Scaling Single-Program Performance on Large-Scale Chip Multiprocessors

نویسندگان

  • Meng-Ju Wu
  • Donald Yeung
چکیده

Due to power constraints, computer architects will exploit TLP instead of ILP for future performance gains. Today, 4–8 state-of-the-art cores or 10s of smaller cores can fit on a single die. For the foreseeable future, the number of cores will likely double with each successive processor generation. Hence, CMPs with 100s of cores–so-called large-scale chip multiprocessors (LCMPs)–will become a reality after only 2 or 3 generations. Unfortunately, simply scaling the number of on-chip cores alone will not guarantee improved performance. In addition, effectively utilizing all of the cores is also necessary. Perhaps the greatest threat to processor utilization will be the overhead incurred waiting on the memory system, especially as on-chip concurrency scales to 100s of threads. In particular, remote cache bank access and off-chip bandwidth contention are likely to be the most significant obstacles to scaling memory performance. This paper conducts an in-depth study of CMP scalability for parallel programs. We assume a tiled CMP in which tiles contain a simple core along with a private L1 cache and a local slice of a shared L2 cache. Our study considers scaling from 1–256 cores and 4–128MB of total L2 cache, and addresses several issues related to the impact of scaling on off-chip bandwidth and on-chip 1 communication. In particular, we find off-chip bandwidth increases linearly with core count, but the rate of increase reduces dramatically once enough L2 cache is provided to capture inter-thread sharing. Our results also show for the range 1–256 cores, there should be ample on-chip bandwidth to support the communication requirements of our benchmarks. Finally, we find that applications become off-chip limited when their L2 cache miss rates exceed some minimum threshold. Moreover, we expect off-chip overheads to dominate on-chip overheads for memory intensive programs and LCMPs with aggressive cores.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Distance Analysis for Large - Scale Chip Multiprocessors

Title of dissertation: REUSE DISTANCE ANALYSIS FOR LARGE-SCALE CHIP MULTIPROCESSORS Meng-Ju Wu, Doctor of Philosophy, 2012 Dissertation directed by: Professor Donald Yeung Department of Electrical and Computer Engineering Multicore Reuse Distance (RD) analysis is a powerful tool that can potentially provide a parallel program’s detailed memory behavior. Concurrent Reuse Distance (CRD) and Priva...

متن کامل

Control Loop Feedback Mechanism for Generic Array Logic Chip Multiprocessor

Control Loop Feedback Mechanism for Generic Array Logic Chip Multiprocessor is presented. The approach is based on control-loop feedback mechanism to maximize the efficiency on exploiting available resources such as CPU time, operating frequency, etc. Each Processing Element (PE) in the architecture is equipped with a frequency scaling module responsible for tuning the frequency of processors a...

متن کامل

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

As compared to a complex single processor based system, on-chip multiprocessors are less complex, more power efficient, and easier to test and validate. In this work, we focus on an on-chip multiprocessor where each processor has a local memory (or cache). We demonstrate that, in such an architecture, allowing each processor to do off-chip memory requests on behalf of other processors can impro...

متن کامل

on Power - Efficient Fault Tolerant Micro architecture for Chip Multiprocessors

Relentless scaling of silicon fabrication technology coupled with lower design tolerances are making ICs increasing susceptible to wear-out related permanent faults as well as transient faults. A well known technique for tackling both transient and permanent faults is redundant execution, specifically space redundancy, wherein a program is executed redundantly on different processors, pipelines...

متن کامل

System Synthesis for Optically- Connected, Multiprocessors On-chip

Optical interconnects are being considered as a possible solution to the wellknown problems of scaling in VLSI interconnects. Along with enabling higher speed interconnects, optics allows the construction of highly connected and irregular networks that are streamlined for particular applications. Using these networks, it is possible to implement application mappings that allow flexible, single-...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009